4 Main Pipeline Functions
4.1 Settings File Parameters
The ProteoMatch package requires a “Settings” file in the format of on .xlsx. Some examples
of this format are provided inst/extdata directory. The example files show the required
parameters, suggestions for how often they will need to be changed between runs, and
a description of the parameter. We’ll take a look at the example here.
SettingsFile <- xlsx::read.xlsx(system.file("extdata", "Intact_Protein_Defaults.xlsx", package = "IsoMatchMS"), 1)
knitr::kable(SettingsFile)
| Parameter | Default | Update | Description |
|---|---|---|---|
| MZRange | 2500-20000 | Everytime | A range of MZ values to filter the data by. It is highly recommended that users visualize the spectra first to determine a reasonable cutoff range. |
| NoiseFilter | 5 | Everytime | An abundance (every peak is scaled to the largest peak) cutoff for peaks. A reasonable value should be in the 2.5 - 10% range. If many peaks are matched to noise, increase this value. |
| Charges | 1,2 | Everytime | The range of charges to test. List charges separated by a comma |
| AbundanceThreshold | 50 | Occasionally | The +/- percent abundance an isotope peak can vary and still be considered a match. If 50%, and the calculated abundance is 3, the matched abundance can vary from 1.5-4 |
| CorrelationMinimum | 0.95 | Occasionally | The minimum correlation value to consider when generating the trelliscope display |
| PPMThreshold | 10 | Occasionally | The maximum m/z error permitted. |
| AdductLabels | proton | Occasionally | Labels for the AdductMasses. Should be separated by a comma with no space (ex. proton,sodium) |
| AdductMasses | 1.00727647 | Occasionally | Masses for the Adducts. Should be separated by a comma with no space (ex. 1.00727647,22.98977) |
| AddMAI | FALSE | Rarely | Add most abundant isotope to the molecular formula calculation step. Warning: This will slow down the tool. |
| IsotopeMinimum | 5 | Rarely | The minimum number of isotopes to consider. We recommend 5 for intact proteomics, and 2 or 3 otherwise. |
| PlottingWindow | 2 | Rarely | The -/+ m/z value on either side of the matched spectra plot. Default is 2 m/z. |
| IsotopingAlgorithm | Rdisop | Rarely | Either “Rdisop” or “isopat”. “Rdisop” is more accurate and recommended, though may crash on windows OS. “isopat” may then be used as an alternative. |
4.2 Creating a Peak_Data Object
Once the optional step of summing the spectra has been completed, the pspecterlib package can be used to create a peak_data object. This can be done with a .mzML file or with two vectors with MZ and Intensity values.
This is an example of reading in a summed mzML file. First, a scan_metadata object is created
using the pspecterlib::get_scan_metadata function. That output is then used in the
pspecterlib::get_peak_data function to create the peak_data object
# Extract the scan metadata
scan_data <- pspecterlib::get_scan_metadata(
MSPath = system.file("extdata", "Intact_Protein_Summed_MS1.mzML", package = "IsoMatchMS")
)
# Create the peak_data object
peak_data <- pspecterlib::get_peak_data(
ScanMetadata = scan_data,
ScanNumber = 1,
MinAbundance = 0.1
)
head(peak_data)
#> M/Z Intensity Abundance
#> 1: 2499.990 1.298022 0.1182
#> 2: 2500.007 2.871200 0.2615
#> 3: 2500.024 4.821795 0.4391
#> 4: 2500.041 6.556417 0.5970
#> 5: 2500.057 7.286436 0.6635
#> 6: 2500.074 7.886676 0.7182
In this example, MZ and Intensity data are read in from a .csv, and the vectors
in the data.frame are fed into pspecterlib::make_peak_data in order to make the peak_data object.
#Reading in .csv
pd_df <- read.csv(system.file("extdata", "Peptides_PeakData.csv", package = "IsoMatchMS"))
#Making the peak_data object
peak_data <- pspecterlib::make_peak_data(
MZ = pd_df$M.Z,
Intensity = pd_df$Intensity
)
head(peak_data)
4.3 Molecular Formulas
This step, as for the remainder of the functions outlined in the remainder of this
vignette, are wrapped in the run_proteomatch function. Return to section 1 to see
examples of running this function.
Even if molecular formulas are provided, the calculate_molform()
function is required to create a IsoMatchMS_MolForm class object. Regardless of format,
the IsoMatchMS_MolForm object will be a data.table with 9 columns: Biomolecules,
Identifiers, Adduct Names, Adduct Masses, Charges, Molecular Formulas, Mass Shifts, Monoisotopic Masses, and
Most Abundant Isotopes.
All of these values are calculated from a combination of the ProForma strings or molecular formulas,
charges, and adducts. If ProForma sequences are provided, they are trimmed to
values between the first and second period. Parenthesis are removed,
and values within square brackets are extracted as post-translation modifications (PTMs).
Molecular formulas, mass shifts, adducts, and charges are all tracked in this dataframe.
Here an example is shown using peptides dataset, which contains Profroma strings.
# Run two examples with two charge states
MolForm <- calculate_molform(
Biomolecules = c("M.SS[Methyl]S.V", "M.S[Methyl]S[22]S[23].V"),
BioType = "ProForma",
Charge = 1:2
)
MolForm %>% knitr::kable()
| Biomolecules | Identifiers | Adduct Name | Adduct Mass | Charge | Molecular Formula | Mass Shift | Monoisotopic Mass | Most Abundant Isotope |
|---|---|---|---|---|---|---|---|---|
| M.SS[Methyl]S.V | NA | proton | 1.007276 | 1 | H19C10N3O7 | 0 | 294.1296 | NA |
| M.SS[Methyl]S.V | NA | proton | 1.007276 | 2 | H19C10N3O7 | 0 | 147.5684 | NA |
| M.S[Methyl]S[22]S[23].V | NA | proton | 1.007276 | 1 | H19C10N3O7 | 45 | 339.1296 | NA |
| M.S[Methyl]S[22]S[23].V | NA | proton | 1.007276 | 2 | H19C10N3O7 | 45 | 170.0684 | NA |
To access the current modifications database, use:
# Load backend glossary
Glossary <- data.table::fread(
system.file("extdata", "Unimod_v20220602.csv", package = "pspecterlib")
)
Glossary %>%
head() %>%
dplyr::select(Modification, `Mass Change`, Residues, H, C, O, N, S) %>%
knitr::kable()
Report new modifications to be added to the pspecterlib github issues page.
4.4 Filter Peaks
The filter_peaks() function allows users to focus their visualization on a particular
range of M/Z values, as well as filter out noise present in their data, which will
improve and speed up the identification process. Larger fragment data (like top-down proteomic data)
should have a higher noise filter than small fragment data (bottom-up), since there
are many lowly abundant peaks with high-intact data.
The abundance values are percentages of each peak’s height compared to the largest peak. If we set a noise filter at 5, if will remove any peaks with any abundance less than 5% of the highest intensity. As a general rule, if too many peaks are matched, try upping the noise filter, and if too little, try a smaller noise filter.
4.5 3. Match Proteoform to MS1
Now, we pass the peak_data object (does not need to be filtered) and the molecular formula
data.table (ProteoMatch_MolForm) to the proteoform matching function. See
?match_biomolecule_to_ms1 for a more detailed explanation of the parameters.
# Run two examples with two charge states
MolForms_Test <- calculate_molform(
Biomolecules = c("M.SS[Methyl]S.V", "M.SS[6]S[7].V"),
BioType = "ProForma",
Identifiers = c("Test1", "Test2"),
Charge = 1:2
)
# Generate some experimental peak data to match
PeakData <- pspecterlib::make_peak_data(
MZ = c(147.5684, 148.0699, 148.5708, 149.0721, 149.5731,
294.1296, 295.1325, 296.1343, 297.1369, 298.1390),
Intensity = c(868.3680036, 110.9431876, 18.7179196, 1.7871629, 0.1701294,
868.3680036, 110.9431876, 18.7179196, 1.7871629, 0.1701294)
)
# Run algorithm
Matches <- match_biomolecule_to_ms1(
PeakData = PeakData,
MolecularFormula = MolForms_Test,
IsotopeMinimum = 2
)
Matches %>% knitr::kable()
| Identifiers | Adduct Mass | Adduct Name | M/Z | Mass Shift | Monoisotopic Mass | Abundance | Isotope | M/Z Search Window | M/Z Experimental | Intensity Experimental | Abundance Experimental | PPM Error | Absolute Relative Error | Pearson Correlation | Charge | Biomolecules | Molecular Formula | Most Abundant Isotope | ID |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Test1 | 1.007276 | proton | 147.5684 | 0 | 147.5684 | 100.000000 | 0 | 0.0014757 | 147.5684 | 868.36800 | 100.0000 | -0.1796489 | 5.5e-06 | 1 | 2 | M.SS[Methyl]S.V | H19C10N3O7 | NA | 1 |
| Test1 | 1.007276 | proton | 148.0699 | 0 | 147.5684 | 12.776057 | 1 | 0.0014807 | 148.0699 | 110.94319 | 12.7761 | 0.1827338 | 5.5e-06 | 1 | 2 | M.SS[Methyl]S.V | H19C10N3O7 | NA | 1 |
| Test1 | 1.007276 | proton | 148.5708 | 0 | 147.5684 | 2.155528 | 2 | 0.0014857 | 148.5708 | 18.71792 | 2.1555 | -0.0688777 | 5.5e-06 | 1 | 2 | M.SS[Methyl]S.V | H19C10N3O7 | NA | 1 |
| Test1 | 1.007276 | proton | 294.1296 | 0 | 294.1296 | 100.000000 | 0 | 0.0029413 | 294.1296 | 868.36800 | 100.0000 | 0.0797234 | 5.5e-06 | 1 | 1 | M.SS[Methyl]S.V | H19C10N3O7 | NA | 2 |
| Test1 | 1.007276 | proton | 295.1325 | 0 | 294.1296 | 12.776057 | 1 | 0.0029513 | 295.1325 | 110.94319 | 12.7761 | 0.1036306 | 5.5e-06 | 1 | 1 | M.SS[Methyl]S.V | H19C10N3O7 | NA | 2 |
| Test1 | 1.007276 | proton | 296.1343 | 0 | 294.1296 | 2.155528 | 2 | 0.0029613 | 296.1343 | 18.71792 | 2.1555 | -0.1485691 | 5.5e-06 | 1 | 1 | M.SS[Methyl]S.V | H19C10N3O7 | NA | 2 |
Each match is generated with a unique ID for plotting purposes. We have three metrics of peak match quality: Absolute Relative Error, Cosine Correlation, and a Figure of Merit.
The equation for Absolute Relative Error is:
\[ \frac{1}{n}*\sum{\frac{|A_R - A_E|}{A_R}} \] where n is the number of peaks matched, \(A_R\) is the reference abundance, and \(A_E\) is the experimental abundance.
4.6 Plot Results
If there is only one match generated, the results can be easily visualized with
plot_Ms1Match. There are many plotting options that can be explored with ?plot_Ms1Match
plot_Ms1Match(
PeakData = PeakData,
Ms1Match = Matches,
ID = 1 # Pull whatever match you're interested in plotting
)
If there are multiple peaks, users can build a trelliscope
display with proteomatch_trelliscope.
isomatchms_trelliscope(
PeakData = PeakData,
Ms1Match = Matches,
Path = "~/Downloads/TrelliTest"
)